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Abstract 

We consider a multi-armed bandits problem 
where payoffs are a linear function of an ob- 
served stochastic contextual variable. In the sce- 
nario where there exists a gap between optimal 
and suboptimal rewards, several algorithms have 
been proposed that achieve O(log r) regret af- 
ter T time steps. However, proposed methods 
either have a computation complexity per itera- 
tion that scales linearly with T or achieve regrets 
that grow linearly with the number of contexts. 
We propose an e-greedy type of algorithm that 
solves both limitations. In particular, when con- 
texts are variables in R'', we prove that our algo- 
rithm has a constant computation complexity per 
iteration of 0{poly{d)) and can achieve a regret 
of 0{poly{d)\ogT) even when \X\ = n{2'^). 
In addition, unlike previous algorithms, its space 
complexity does not grow with T. 



1 Introduction 

The contextual multi-armed bandit problem is a sequential 
learning problem. At each time step, a learner has to chose 
among a set of possible actions/arms A. Prior to making 
its decision, the learner observes some additional side in- 
formation X & X over which he has no influence. This is 
commonly referred to as the context. 

In general, the reward of a particular arm a ^ A under 
context X & X follows some unknown distribution. The 
goal of the learner is to select arms so that it minimizes 
its expected regret, i.e., the expected difference between its 
cumulative reward and the reward accrued by an optimal 
policy, that knows the reward distributions. 



sampled from an unknown distribution in an i.i.d. fashion. 
Unfortunatel y, the proposed alg orithm and subsequent im- 



provements (iDudik et al.L 1201 11) have high computational 



Langford and Zhana (l2007h propose an algorithm called 
epoch-Greedy for general contextual bandits. Their al- 
gorithm achieves an O(logr) regret in the number of 
timesteps T in the stochastic setting, in which contexts are 



complexity. Selecting an arm at time step t requires mak- 
ing a number of calls to a so-called optimization oracle that 
grows polynomially in T. In addition, implementing this 
optimization oracle can have a cost that grows linearly in 
\X\ in the worst case; this is prohibitive in many interest- 
ing cases, including the case where \X\ is, exponential in 
the dimensi on of the context. In addition , bot h algorithms 



propos ed in iLangford and Zhang! (I2007h and iDudik et al. 



(1201 Ih require keeping a history of observed contexts and 
arms chosen at every time instant. Hence, their space com- 
plexity grows linearly in T. 

In this paper, we show that the challenges above can be 
addressed when rewards are linear. In the above contextual 
bandit set up, this means that <Y is a subset of R"^', and 
the expected reward of an arm a G ^ is an unknown linear 
function of the context x, i.e., it has the form x^Oa, for some 
unknown vector 6 a- This is a case of great interest, arising 
naturally when, conditioned on x, rewards from different 
arms are uncorrelated. 

Example 1. (Processor Scheduling) A simple example is 
assigning incoming jobs to a set of processors A, whose 
processing capabilities are not known a priori. This could 
be the case if, e.g., the processors are machines in the cloud 
or, alternatively, humans offering their services through, 
e.g.. Mechanical Turk. Each arriving job is described by 
a set of attributes x E WV^, each capturing the work load 
of different types of sub-tasks this job entails, e.g., compu- 
tation, I/O, network communication, etc. Each processor's 
unknown feature vector da describes its processing capac- 
ity, i.e., the time to complete a sub-task unit, in expectation. 
The expected time to complete a task x is given by x'^Oa', 
the goal of minimizing the delay (or, equivalently, maxi- 
mizing its negation) brings us in the above bandit setting. 
□ 

Example 2. (Group Activity Selection) Another motivat- 
ing example is maximizing group ratings, observed as the 



outcome of a secret ballot election. In this setup, a sub- 
set of d users congregate to perform a joint activity, such 
as, e.g., dining, rock climbing, watching a movie, etc. The 
group is dynamic and, at each timestep t e N, the vector 
X e {0, l}'^, is an indicator of present partiticants. An arm 
(i.e., a joint activity) is selected; at the end of the activity, 
each user votes whether they liked the activity or not in a 
a secret balot, and the final tally is disclosed. In this sce- 
nario, the unknown vectors 9a G IR"^ indicate the probabil- 
ity a given participant will enjoy activity a, and the goal is 
to select activities that maximize the aggregate satisfaction 
among participants present at the given timestep. □ 

Our contributions are as follows. 

• We isolate and focus on linear payoff case of stochas- 
tic multi-armed bandit problems, and design a simple 
arm selection policy which does not recourse to so- 
phisticated oracles inherent in prior work. 

• We prove that our policy achieves an O(logT) regret 
after T steps in the stochastic setting, when the ex- 
pected rewards of each arm are well separated. This 
meets the regret bound of best known algorithms for 
contextual multi-armed bandit problems. 

• We show that our algorithm has 0(|^|d^) computa- 
tional complexity per step and its expected space com- 
plexity scales like 0(|^|d^). This is a significant im- 
provement over known contextual multi-armed bandit 
problems, as well as for bandits specialized for linear 
payoffs. 



Our algorithm is inspired by the work of lAuer et alj (120021) 
on the e-greedy algorithm and the use of linear regression 
to estimate the parameters 9a. The main technical innova- 
tion is the use of matrix concentration bounds to control the 
error of the estimates of 9a in the stochastic setting. We be- 
lieve that this is a powerful realization and may ultimately 
help us analyze richer classes of payoff functions. 

The remainder of this paper is organized as follows: in Sec- 
tion |2] we compare our results with existing literature. In 
Section [3] we describe the set up of our problem in more 
detail. In the Section |4] we state our main results and in 
Section |5] we discuss them. Section |6] is devoted to exem- 
pUfying the performance and limitations of our algorithm 
by means of simple numerical simulations. In Section |7] 
we prove our results. Finally, in Section [8] we draw our 
conclusions. 

2 Related Work 



ration, i.e., selecting arms o G ^ to sample rewards from 
the distributions pa^x and learn about them, and exploita- 
tion, whereby knowledge of these distributions based on 
the samples is used to select an arm that yields a high pay- 
off. 

A significant challenge is that during the exploitation 
phase, conditioned on the fact that a arm a was choosen, the 
distribution of observed contexts does not follow p{x\a). In 
fact, an arm will tend to be selected more often in contexts 
in which it performs well. The epoch-Greedy algorithm 
deals with this by separating the exploration and exploita- 
tion phase, effectively selecting an arm uniformly at ran- 
dom at certain time slots (the exploration "epochs"), and 
using samples collected only during these epochs to esti- 
mate the payoff of each arm in the remainin g time slots 
(for exploitation). Langford and Zhan3 ( 2007 ) establish an 
0(T^/3(ln on the regret for epoch-Greedy in the 

stochastic setting. They further improve this to O(logT) 
when a lower bound on the gap between optimal and sub- 
optimal arms in each context exists. 

U nfortunately, the price of the generality of the framework 
of iLangford and Zhang (120071) is the high computational 
complexity when selecting an ar m during an exploi tation 



phase. In a recent improvement (iDudik et all 1201 Ih . this 



computation requires a poly{t) number of calls to an opti- 
mization oracle. Most importantly, even in the linear case 
we study here, there is no clear way to implement this or- 
acle in sub- exponential t i me in d, the dimension of the 
context. As iDudik et al.l ( 1201 In point out, the optimiza- 
tion oracle solves a so-called cost-sensitive classification 
problem. In the particular case of linear bandits, the or- 
acle thus reduces to finding the "least-costly" linear clas- 
sifier This is hard, even in the case of only two arms: 
finding the linear classifier with the min i mal n umber of 
errors is NP-hard (iJohnson and PreparataL 119781) . and re- 
mains NP hard even if an appr oximate solution is required 
dBartlett and Ben-Davidlll999t) . 



The original paper bv lLangford and Zhand (l2007h assumes 
that, conditioned on the arm and the context, rewards are 
sampled from a probability distribution Pa,x- As is com- 
mon in bandit problems, there is a tradeoff between explo- 



As such, a different approach is warranted under linear re- 
wards. On the other hand, linear bandits have been exten- 
sively studied in the following setup, which is more general 
than ours. In the classic linear bandit setup, the arms them- 
selves are represented as vectors, i.e., A C R'', and, in 
addition, the set A can change from one time slot to the 
next. The expected payoff of an arm a with vector Xa is 
given by x\9, for some unknown vector 9 G R'^, common 
among all arms. 

There are se veral different variants of the a b ove m odel. 
AueJ (l2002h and , more recently, iLi et alj (l2010l) and 
ChuetalJ (l2oT J), study this problem in the adver- 
sarial setting. In particular, |^| is fixed (and fi- 
nite) and A C R'' is given at each time by 
an adversary that has full knowledge of what the 
learner knows, but cannot a priori predict the outcome 



of an y random variab l es before the learner observes 
them. Dani et al. I (l2008h. iRusmevichientong and TsitsiklisI 



ilQldt . and lAbbasi-Yadkori et alj (l201lb studv the prob- 
lem in the stochastic setting, in cases where ^ is a fixed 
but possibly uncountable bounded subset of IR*^. 

The regret bounds on all of the above setups (both stochas- 
tic and adversarial) are of the order of 0{^/Tpolylog{T)). 
An important distinction between the aforementioned gen- 
eral linear bandit setup a nd the contextual model of 
Langford and ZhangI (l2007h (and, a fortiori, our model as 
well) is that in the above setting, different arms' payoffs 
are correlated. Payoffs observed for any arm inform the 
learner about the common unknown 6 and, hence, help in- 
fer the payoff of a different arm. Exploiting this correla- 
tion to achieve low r egret constitutes the main ch allenge of 
the above setups. In lLangford and Zhand (120071) . as in our 
case, the reward of an arm does not reveal any information 
about the reward of another arm. Nevertheless, the rewards 
for the same arm under different contexts are correlated. 
Exploiting this correlation to learn the unknown vectors 6a 
faster and achieve low regret constitutes the main challenge 
of our setup. 



nite. We denote the arm played at this time by at- We 
study adaptive arm selection policies, whereby the selec- 
tion of at depends only on the current context Xt, and 
on all past contexts, actions and rewards. In other words, 

Of ~ fl* {xt, {Xt, ''r},-=l)- 

3.3 Payoff 

After observing a context xt and selecting an arm at, the 
learner receives a payoff which is drawn from a dis- 
tribution Pat .Xt independently of all past contexts, actions 
or payoffs. We assume that the expected payoff is a linear 
function of the context. In other words. 



(1) 



where {ea,t}a£A,t>i are a set of independent random vari- 
ables with zero mean and {6a}aeA are unknown parame- 
ters in R''. Note that, w.l.o.g, we can assume that Q = 
maxae^ ll^a||2 < 1- This is because if Q > 1 , as payoffs 
are linear, we can devide all payoffs by Q; the resulting 
payoff is still a linear model, and our results stated below 
apply. 

Recall that Z is a sub-gaussian random variable with con- 



Note that our problem can be expressed as a special case of ^ ^ 

the above linear bandits setup by taking 61= [^i ; . . . ; 6*^] e ^^^"^.^ ^£!^if^^^,-.f ^ • _ In particular, sub-gaussianity 
Ji^'^, where K = |^|, and, given context x, associating 
the i-th arm with an appropriate vector of the form Xat = 
[0 ... X ... 0]. As such, all of the above 0{VTpolylog{T)) 
bounds (and respective algorithms) can be applied to our 
setup. However, these algorithms do not exploit the fact 
that, in our setting, arms are uncorrelated. We exploit this 
to obtain a logarithmic regret, for a much simpler algorithm 
than the ones outlined in the above works. 



implies = 0. We make the following technical as- 

sumption. 

Assumption 1 The random variables {€a.t}aeA.t>i o.re 
sub-gaussian random variables with constant L > 0. 

3.4 Regret 



Given a context x, the arm that gives highest expected re- 
ward is 



3 Model 

In this section, we give a precise definition of the linear 
contextual bandit problem we study in this work. 

3.1 Contexts 

At every time instant t G {1,2,...}, a context xt E X C 
IFC^, is observed by the learner. We assume that ||a;||2 < 1; 
as the expected reward is linear in x, this assumption is 
w.l.o.g. We prove our main result (Theorem [2]i in the 
stochastic setting where xt are drawn i.i.d. from an un- 
known multivariate probabihty distribution V. 

In addition, we require that the set of contexts is finite i.e., 
\X\ < oo. We define Smin > to be the smallest non-zero 
eigenvalue of the covariance matrix S = E{xix|}. 

3.2 Arms and actions 

At time t, after observing the context Xt, the learner de- 
cides to play an arm a E A, where K = \A\ is fi- 



arg max x 

aeA 



(2) 



The expected cumulative regret the learner experiences 
over T steps is defined by. 



R{T)^E{Y.xU9a^^^^9at)}. 



The expectation above is taken over the contexts xt . 



(3) 



The objective of the learner is to design a policy at = 
at (xf , {xr, ar, rrYrZk) '^hat achieves as low expected cu- 
mulative regret as possible. In this paper we are also inter- 
ested in arm selection policies having a low computational 
complexity. 

We define A,„ax = maxa^be^ \ \9a - 0b\\2, and 



inf 

x£X,a:xWa<xH 



On) > 



Observe that, by the finiteness of X and A, the above infi- 
mum is attained {i.e., it is a minimum) and is indeed posi- 
tive. 



4 Main results 



We now present a simple and efficient on-line algorithm 
that, under the above assumptions, has expected logarith- 
mic regret. Specifically, its computational complexity, at 
each time instant, is 0{Kd?) and the expected memory re- 
quirement scales like 0{K(P). As far as we know, our 
analysis is the first to show that a simple and efficient al- 
gorithm for the problem of linearly parametrized bandits 
can, under reward separation and i.i.d. contexts, achieve 
logarithmic expected cumulative regret. 

Before we present our algorithm in full detail, let us give 
some intuition about it. Part of the job of the learner is 
to estimate the unknown parameters 6a based on past ac- 
tions, contexts and rewards. We denote the estimate of Oa 
at time t by Oa.t- If da ~ ^a.t then, given an observed con- 
text, the learner will more accurately know which arm to 
play to incur in small regret. The estimates Oa.t can be con- 
structed based on a history of past rewards, contexts and 
arms played. 

Since observing a reward r for arm a under context x does 
not give information about the magnitude of 9a along di- 
rections orthogonal to x, it is important that, for each arm, 
rewards are observed and recorded for a rich class of con- 
texts. This gives rise to the following challenge: If the 
learner tries to build this history while trying to minimize 
the regret, the distribution of contexts observed when play- 
ing a certain arm a will be biased and potentially not rich 
enough. In particular, when trying to achieve a small re- 
gret, conditioned on at — a, it is more likely that xt is a 
context for which a is optimal. 

We address this challenge using the following idea, 
also appearing in the e poch-Greedy algorithm of 
Langford and Zhana (l2007b . We partition time slots 
into exploration and exploitation epochs. In exploration 
epochs, the learner plays arms uniformly at random, 
independently of the context, and records the observed 
rewards. This guarantees that in the history of past events, 
each arm has been played along with a sufficiently rich 
set of contexts. In exploitation epochs, the learner makes 
use of the history of events stored during exploration 
to estimate the parameters 6a and determine which arm 
to play given a current observed context. The rewards 
observed during exploitation are not recorded. 



More specifically, when exploiting, we learner performs 
two operations. In the first operation, for each arm a & A, 
an estimate 9a of 9a is constructed from a si mple £2- 



regula rized regression, as in in lAuen (120021) and iLi et al 



(I2OIOI) . n the second operation, the learner plays the arm a 
that maximizes x\9a. Crucially, in the first operation, only 
information collected during exploration epochs is used. In 
particular, let Ta,t-i be the set of exploration epochs up 
to and including time t — 1 (i.e., the times that the learner 



played an arm u.a.r). Moreover, for any T G N, denote 
by rj- € R" is a vector of observed rewards for all time 
instances t <E T, and Xj- e R"^'' is a matrix of T rows, 
each containing one of the observed contexts at time t E T- 
Then, at timeslot t the estimator 9a is the solution of the 
following convex optimization problem. 



min — II rr 



Xrewl 



(4) 



where T = Ta,t-i, n |7^,t-i|, A„ = l/^/n. In 
other words, the estimator 9a is a (regularized) estimate 
of 9a, based only on observations made during exploration 
epochs. Note that the solution to (HI is given by 



1 + 

n 



(5) 



Algorithm 1 Contextual e -greedy 



For all ae A, set Aa Odxd ;na ^ Q;ba ^ Od 
for t = 1 to p do 

a ^ 1 + {t mod K) 

Play arm a 

na ^ na + 1; ba ba + nXt, Aa ^ A 

a ~r XtX^ 

end for 

for t = p + 1 to T do 
e Bernoulli (p/t) 
if e = 1 then 

a ^ Uniform(l/A') 
Play arm a 

ria ^ na + 1; ba ^ ba + rtXt, Aa ^ A 

a I XtXj- 

else 

for a e ^ do 

Get 9a as the solution to the linear system: 

[XnJ+^^Aa) Oa=^ba 

end for 

Play arm a = arg max?, x\9i, 
end if 
end for 



An important design choice is the above process selection 
of the time slots at which the algorith m explores, rather 
than e xploits. Following the ideas of Sutton and Bartd 
(119981) . we select the exploration epochs so that they oc- 
cur approximately O(logi) times after t slots. This guar- 
antees that, at each time step, there is enough information 
in our history of past events to determine the parameters ac- 
curately while only incurring in a regret of (9 (log t). There 
are several ways of achieving this; our algorithm explores 
at each time step with probability 0(i^^). 



The above steps are summarized in pseudocode by Algo- 
rithm [T] Note that the algorithm contains a scaling param- 
eter p, which is specified below, in Theorem |2] Because 
there are K arms and for each arm {xt^ra^t) € R''+^, 



the expected memory required by the algorithm scales like 
0{KcP). In addition, both the matrix X^Xj- and the 
vector X'!yrj- can be computed in an online fashion in 
0{(P) time: x|Xr ^ x\-Xr + xtx\ and x\-rr ^ 
X^rj- + rtXf. Finally, the estimate of 9a does not require 
full matrix inversion but only solving a linear system (see 
Algorithm[T]i, which can be done in 0{d^) time. The above 
is summarized in the following theorem. 

Theorem 1 Algonthm\l\has computational complexity of 
0{KcP) per iteration and its expected space complexity 
scales like 0{K(f). 

We now state our main theorem that shows that Algorithm 
[TJachieves R{T) = O(logT). 

Theorem 2 Under Assumptions [7] the expected cumula- 
tive regret of algorithm\l\satisfies, 

R{T) < pA 

max 

for any 

CKL'^ 



Above, C is a universal constant, A'^^^ = min{l, Amin}, 
^'mhi = min{l, Sinin} and L' = max{l, L}. 

Algorithm[T]requires the specification of the constant p. In 
Section 14.21 we give two examples of how to efficiently 
choose a p that satisfies 

In Theorem|2l the bound on the regret depends on p - small 
p is preferred - and hence it is important to understand how 
the rh.s. of (|6]l might scale when K and d grow. In Sec- 
tion 14.11 we show that, for a concrete distribution of con- 
texts and choice of ejcpected rewards 9a, and assuming Q 
holds, p = 0{K^d^)^. There is nothing special about the 
concrete details of how contexts and 9a's are chosen and, 
although not included in this paper, for many other distribu- 
tions, one also obtains p = 0{poly{d)). We can certainly 
construct pathological cases where, for example, p grows 
exponentially with d. However, we do not find these in- 
tuitive. Specially when interpreting these having in mind 
real applications as the ones introduced in Example 1 and 
Example 2. 

4.1 Example of scaUng of p with d and K 

Assume that contexts are obtained by normalizing a d- 
dimensional vector with i.i.d. entries as Bernoulli random 
variables with parameter w. Assume in addition that every 
6a is obtained i.i.d. from the following prior distribution: 



every entry of 9a is drawn i.i.d. from a uniform distribution 
and then 9a is normalized. Finally, assume that the payoffs 
are given by Va^t = xlQa, where 6a G E,^ are random 
variables that fluctuate around 9a = E{Oa} with each en- 
try fluctuating by at most F. 

Under these assumptions the following is true: 

• Smin — ^{d~^). In fact, the same result holds asymp- 
totically independently of ui = w{d) if, for example, 
we assume that on average groups are roughly of the 
same size, M, with w ~ M /d\ 

• L ~ 0{^/d). This holds because e^.t = ''a,* — 
^{Ta.t\ = x\{Qa — da) are bounded random Variables 
with zero mean and ||a;|(8a — 6'a)}!|oo = 0{y/d). 

• Amin = ^{'i/{Kdy/w) with high-probability (for 
large K and d). This can be see as follows, if Amin = 
x'^{9a — 9b) for some x, a and h, then it must be true 
that 9a and 9), differ in a component for which x is 
non-zero. The minimum difference between compo- 
nents among all pairs of 9a and 6*5 is lower bounded by 

{K^fd)) with high probability (for large K and 
d). Taking into account that each entry of x is of order 
0{\/ dw) with high-probability, the bound on Amin 
follows. 

If we want to apply Theorem |2] then ^ must hold and 
hence putting all the above calculations together we con- 
clude thatp = 0{K'^d^). 

4.2 Computing p in practice 

If we have knowledge of an a priori distribution for the con- 
texts, for the expected payoffs and for the variance of the 
rewards then we can quickly compute the value of Emin, L 
and a typical value for Amin- An example of this was done 
above (Section l4~TI ). There, the values were presented only 
in order notation but exact values are not hard to obtain for 
that and other distributions. Since a suitable p only needs 
to be larger then the r.h.s. of (|6]l, by introducing an ap- 
propriate multiplicative constant, we can produce a p that 
satisfied ^ with high probability. 

If we have no knowledge of any model for the contexts or 
expected payoffs, it is still possible to find p by estimat- 
ing Amin, Smin and L from data gathered while running 
Algorithm [T] Notice again that, since all that is required 
for our theorem to hold is that p is greater then a certain 
function of these quantities, an exact estimation is not nec- 
essary. This is important because, for example, accurately 
estimating Smin is hard when matrix E{a:;ix|} has a large 
condition number 



'This bound holds with probability converging to 1 as K and 
d get large 



Not being too concerned about accuracy, Emin can be es- 
timated from Ejxia;}}, which can be estimated from the 



sequence of observed xt- can be estimated from 

Algorithm [T] by keeping track of the smallest difference 
observed until time t between vaax^x^Ob and the second 
largest value of the function being maximized. Finally, 
the constant L can be estimated from the variance of the 
observed rewards for the same (or similar) contexts. To- 
gether, these estimations do not incur in any significant loss 
in computational performance of our algorithm. 

5 Adversarial setting 

In the stochastic setting, the richness of the subset of IR"* 
spanned by the observed contexts is related to the skew- 
ness of the distribution T). The fact that our result de- 
pends on Emin and that the regret increases as this value 
becomes smaller indicates that we are still far from prov- 
ing the O(logT) regret for the adversarial setting, where 
an adversary choses the contexts. 

In particular, the main difficulty in using a linear regres- 
sion, and the reason why our result depends on Emin, is 
related to the dependency of our estimation of x\9a on 



Algorithm 2 Contextual UCB 



IT., 



Ta 



X 



Ta,t-1 



It is not hard to show that the error 



in approximating x\Oa with Xf Oa is proportional to 




n 



r 



Xt- 



(7) 



Looking at this expression one quickly sees that, even if a 
given context has been observed relatively often in the past, 
the algorithm can "forget" it because of the mean over con- 
texts that is being used to produce our estimates of x\9a- 

The effect of this phenomenon on the performance of Al- 
gorithm[T]can be readily seen in the following pathological 
example. Assume that A" = {(1, 1), (1,0)} C R^. Assume 
that the contexts arrive in the following way: (1,1) appears 
with probability l/I and (1,0) appears with probability 
1 — 1//. The correlation matrix for this stochastic process is 
{(1, 1//), (1//, 1/-^)} and its minimum eigenvalue scales 
like 0{1/I). Hence, the regret scales as 0{P logT). If 
/ is allowed to slowly grow with t, we expect that our al- 
gorithm will not be able to guarantee a logarithmic regret 
(assuming that our upper bound is thight). In other words, 
although (1,1) might have appeared a sufficient number of 
times for us to be able to predict the expected reward for 
this context. Algorithm [T] performs poorly since the mean 
(|7]i will the 'saturated' with the context (1,0) and forget 
about (1, 1). 

The solution for this problem is to ingore some past con- 
texts when building an estimate for x\6a, by including in 
the mean (|7]i past contexts that are closer in direction to the 
current context xt. H aving this in mind, and building on the 
ideas of lAueij(l2002l) . we propose the UCB -type Algorithm 

El 



for t = 1 to p do 

a 1 + (t mod K) 
Play arm a 

Ta^t ^ Ta.t-1 U {t} 

end for 

for t = p + 1 to T do 
for a e ^ do 

\ogt 



Ca,t 



mm 

TCTaJ- 



\r\ 



1 



Xnl + -X\Xr 



Xt 



T* ^ subset of 7^ t_i that achieves the minimum 

Get 9a as the solution to the hnear system: 

A„/ + ix|Xr)e; = (ixVr) 
end for 

Play arm a ~ arg maxf, x\9b + ^Jch^ 

Ta,t ^ Ta,t~l U {t} 

end for 



It is straightforward to notice that this algorithm cannot be 
implemented in an efficient way. In particular, the search 
for T* C Ta.t-i has a computational complexity expo- 
nential in t. The challenge is to find an efficient way of 
approximating T* efficiently. This can be done by either 
reducing the size of Ta.t~i - the history from which one 
wants to extract Tat-i - by not storing all events in mem- 
oryl or by finding an efficient algorithm of approximating 
the minimization over the Ta.t-i (or both). It remains an 
open problem to find such an approximation scheme and to 
prove that it achieves 0(log T) regret. 

6 Numerical results 

In Theorem m we showed that, in the stochastic setting. 
Algorithm [T] has an expected regret of order O(logT). 
We now illustrate this point by numerical simulations and, 
most importantly, exemplify how violating the stochastic 
assumption might degrade its performance. 

Figure [T| shows the average cumulative regret (in semi- 
log scale) over 10 independent runs of Algorithm [T] for 
T = 10^ and for the following setup. The context vari- 
ables X e R"^ and at each time step {a;t}t>i are drawn 
i.i.d. in the following way: (a) set each entry of a; to 1 or 
independently with probability 1/2; (b) normalize x. We 
consider A' = 6 arms with corresponding parameters 9a 
generated independently from a standard multivariate gaus- 
sian distribution. Given a context x and an arm a, rewards 
were random and independently generated from a uniform 
distribution U{[0, 2x''9a])- As expected, the regret is loga- 
rithmic. Figure[T]shows a straight line at the end. 



"For example, if we can guarantee that \Ta,t \ = 0(log t) then 
the complexity of the above algorithm at time step t is 0{t). 
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Figure 1: Regret over T when xt is from i.i.d. 
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Figure 2: Regret over T when xt is not from i.i.d. 



To understand the effect of the stochasticity of x on the 
regret, we consider the following scenario: with every 
other parameter unchanged, let X = {x, x'}. At every 
time step x ~ [1, 1, 1] appears with probability 1//, and 
x' = [1, 0, 1] appears with probability 1 — (1//). Figure|2] 
shows the dependency of the expected regret on the con- 
text distribution for / = 5, 10 and 100. One can see that an 
increase of / causes a proportional increase in the regret. 

7 Proofs 

7.1 Proof of TheoremE] 

The genera l structure of the p roof of our main result fol- 
lows that of lAuer et alJ (l2002h . The main technical innova- 
tion is the realization that, in the setting when the contexts 
are drawn i.i.d. from some distribution, a standard matrix 



concentration bound allows us to treat A„/ + n^^{X^X-Y) 
in Algorithm[T]as a deterministic positive-definite symmet- 
ric matrix, even as A„ — ?• 0. 

Let £t denote the time instances fort > p and until time T 



in which the algorithm took an exploitation decision. Re- 
call that, by Cauchy-Schwarz inequality, xl{9a* —Oat) — 

\/dAniax- In addition, recall that X]t=2 1/^ — log 2^- For 
R{T) the cummulative regret until time T, we can write. 



max 



< pA,„axVd + An,axVdIE{|£:T|} 

+ ^maxVdE{ 5]l^^t,^^<,t,^. }} 

< pAmaxVrf + pAmaxVrflogT 

+ A.„axVrfE{ ^I^^t,^^<^t,^. }} 

< pAmaxVrf + pAmaxVdlogT 

+ A„axVdE{ Y.h4s..>4L> 



In the last line we used the fact that when exploiting, if 
we do not exploit the optimal arm a*^, then it must be the 
case that the estimated reward for some arm a, x\Oa.t, must 
exceed that of the optimal arm, x\Oai ,t, for the current 

context Xt- Above, we use 6*^ ^ instead of 9a to explicitly 
state the dependency of estimators on the stored history for 
each arm until time t. 

We can continue the chain of inequalities and write, 

R{T) < pA 

max 

T 

+ A 

max 

t=i 

The above expression depends on the value of the estima- 
tors for time instances that might or might not be exploita- 
tion times. For each arm, these are computed just like in 
Algorithm[Tl using the most recent history available. The 
above probability depends on the randomness of xt and on 
the randomness of recorded history for each arm. 



Since xl 



6a) > Ainin wc Can write 



We now bound each of these probabilities separately. Since 
their bound is the same, we focus only on the first proba- 
bihty. 



Substituting the definition of Va^t = x\6a + ea,t into the 
expression for Oa^t one readily obtains, 



We are using again the notation T = Ta,t-i and n = |T|. 
From this expression, an appHcation of Cauchy-Schwarz's 
inequahty and the triangular inequality leads to. 



\xlieaj^9a)\ 



(/ X'j-Ca,' 
n 



< ]J xl (^KI + ^X^r^r J xt 
We introduce the following notation 



Ca.t 



Ix] ( A„/- 



Xt- 



(8) 



Note that, given a and t both n and T are well specified. 
We can now write, 

+ „ A, 



P{4^a,t > xlt 



■} 



A 111 in 

Ai, 

2Ca, 



XnlxlOal} 

\„q}. 



Since ea.r are sub-gaussian random variables with sub- 
gaussian constant upper bounded by L and since [xJxt- | < 
1, conditioned on xt, T and {xi-jreT' each 
sub-gaussian random variable and together they form a set 
ofi.i.d. sub-gaussian random variables. One can thus apply 
standard concentration inequality and obtain. 



P-| ^ xtx-j-Ea.r 



> 



2c, 

A„Qy 



A, 



}■ 



(9) 



where both n and t are random quantities and the opera- 
tor + acts on scalars as follows: z+ = 2 if z > and zero 
otherwise. 



We now upper bound Ca,t using the following fact about 
the eigenvalues of any two real-symmetric matrices Mi 

and M2: Aiiia,((Afi)-i) = l/Aiiiiii(A/i) and Anii,i(Afi + 
A/2) > Aiiiiii(A/i) - Aiiia,(Af2) = Aiiiiii(Afi) - IIA./2II. 



Ca,t 




Xnl + -X^tXt 



Xt 



< (A„ + A+iJE{4a;i})- ||ix|Xr-E{xl.Ti}|pj 



Both the eigenvalue and the norm above only need to 
be computed over the subspace spanned by the vec- 
tors Xt that occur with non-zero probability. We use 
the symbol + to denote the restriction to this sub- 
space. Now notice that ||.||+ < ||.|| and, since we de- 
fined Smin = niinj:Ai>o Ai(E{XiXj}), we have that 
A+i„(E{XiXj}) > Eiiiin. Using the following defini- 
tion, Ai;„ = n-^x\-X\- - ¥.{XiX\}, this leads to, 

Ca,t < {Xn + Smin - ||AI]„||)-1 < (Smin - ||AI]„||)"\ 

We now need the following Lemma. 

Lemma 1 Let {Xi}"^^ be a sequence ofi.i.d. random vec- 
tors of 2-norm bounded by 1. Define S = ^ Sr=i -^i-^l 
andY.^ E{XiX|}. Ife e (0, 1) then, 



P(|E-E|1 >e|lS|l)<2e-^^", 

where C < 1 is an absolute constant. 

For a proof see 'Introduction to the non-asymptotic anal- 
ysis of random matrices', 2011, by Roman Vershynin 
(Corollary 50). 

We want to apply this lemma to produce a useful bound on 
the r.h.s. of (|9]l. First notice that, conditioning on n, the 
expression inside the expectation in (|9]) depends through 
Cq t on n i.i.d. contexts that are distributed according to the 
original distribution. Because of this, we can write. 



] 



n=l 

\Ta.t-l\=n}). 



X E{2e"^(^"^"^)^ 



Using the following algebraic relation: if z,w > then 



{z — w)~^ > — 2zw, we can now write, 



\Ta,t-l\ = 



< P{|AS„| > S„,i„/2| \Taj-i\ = n} 

< P{|AE„| > I],„i„/2| ITa.t-il = n} 



,.(A„i„)^(S„i„)^ 



Using Lemma[T]we can continue the chain of inequalities, 



< 2e-C(S„.„)^n/4 



r 



\Ta,t-i\ = ^^| 



-)- e e 321.2 _ 

Before we proceed we need the following lemma. 
Lemma 2 T/'ric = ^ logt then 

H\Ta,t-i\ <nc} <t-^. 
We can now write. 



1 / , "^t •^T^a,T 



> -7. 

2Ca,t 



QA„i„E„ 

2e 



p Cp(S,„i„)2 qA,-„i„S,-„i„ p(A,„;„)^(S„-,;„)2 



We want this quantity to be summable over t. Hence we 
require that, 

^ 128KL^ ^ 16K ^^2^ (10) 

It is immediate to see that our proof also follows if Amin, 
Sniin and L are replaced by A;^;^ = min{l,Amin}, 
^min ^ S„iin} and L' = max{l,L} respectively. 

If this is done, it is easy to see that conditions ( fTOb are all 
satisfied by the p stated in Theorem|2l 

Since X^t^i ^/^^ — 2' gathering all terms together we 
have. 



i?(T) < pA 

max 

+ A,„ax\/d/^ (^46"^ 4T'^"" + 10^ 

< pA„,ax\/d+ 14A,„ax%/rfA'e'^/^ +pAmaxVdlogT. 



7.2 Proof of Lemma |2] 

First notice that |7^,f-i| = -^i where {zj;}*!^ are 

independent Bernoulli random variables with parameter 
p/ (Ki). Remember that we can assume that i > p since in 
the beginning of Algorithm[T]we play each samp/K times. 

Now write, 

n\Ta.t-i\ <nc)-- 

= P (y.{z, - p/{Ki)) <n,- {p/K) ^ l/i \ 

\i=l 1=1 / 

< P [Y,{-z, + p/i) > -n, + (p/K) J2 1/* ) 

\i=l 4=1 / 

<p(^(-z,+p/0>(p/A01ogt-nJ . (11) 




,4=1 



Since Etlmz^ - p/m?} = e:i;+i(i - 

p/{Ki)){p/{Ki)) < log t, we have that {-Zi+p/i}*li[ 
are i.i.d. random variables with zero mean and sum of 
variances upper bounded by {p/K) \ogt. Replacing Uc = 
{p/2K) \ogt in (fTTl i and applying Bernstein inequality we 
get, 

l-(p/(2R-))2 log2 t 

P(|Ta,t-i| <n,)<e :&i°st+i(p/(2/<))ioEt < ^-yfe. 

8 Conclusions 

We introduced an e-greedy type of algorithm that provably 
achieves logarithmic regret for the contextual multi-armed 
bandits problem with linear payoffs in the stochastic set- 
ting. Our online algorithm is both fast and uses small space. 
In addition, our bound on the regret scales nicely with 
dimension of the contextual variables, O {poly {d) log T). 
By means of numerical simulations we illustrate how the 
stochasticity of the contexts is important for our bound 
to hold. In particular, we show how to construct a sce- 
nario for which our algorithm does not give logarithmic re- 
gret. The reason for this amounts to the fact that the mean 
n^^X^Xj- that is used in estimating the parameters 6a can 
"forget" previously observed contexts. Because of this, it 
remains an open problem to show that there are efficient al- 
gorithms that achieve 0{poly{d) logT) under reward sep- 
aration (Ainiii > 0) in the non-stochastic setting. We be- 
lieve that a possible solution might be constructing a vari- 
ant of our algorithm where in x'^Xj- we use a more 
careful average of past observed contexts give the current 
observed context. In addition, we leave it open to produce 
simple and efficient online algorithms for multi-armed ban- 
dit problems under rich context models, like the one we 
have done here for linear payoff. 
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